To create a heatmap of Baltimore city crime data, in an effort to determine where NOT to live.
I plan to create a google fusion table because it has heatmap functionality built in and I might be able to easily overlay it on the housing map I already have (which really would speed things up).
Here my goal is to prep data for fusion table. It needs geocodes for approximate location (which it already has) and I want to take advantage of the optional weights to better represent the dangers as I personally perceive them.
dl_date <- Sys.Date()
save_file <- gsub("-", "", dl_date) %>%
paste0("~/Housing/", ., "-BPD_Part1_crime.csv")
BC_crime_url <- "https://data.baltimorecity.gov/api/views/wsfq-mvij/rows.csv?accessType=DOWNLOAD"
download.file(BC_crime_url, destfile = save_file)
crime <- read_csv(save_file, col_types =
cols(
CrimeDate = col_date(format = "%m/%d/%Y"),
CrimeTime = col_character(), #problems with time format
CrimeCode = col_factor(levels = NULL),
Location = col_character(),
Description = col_factor(levels = NULL),
`Inside/Outside` = col_factor(levels = NULL),
Weapon = col_factor(levels = NULL),
Post = col_integer(),
District = col_character(),
Neighborhood = col_character(),
Longitude = col_double(),
Latitude = col_double(),
`Location 1` = col_character(),
Premise = col_character(),
`Total Incidents` = col_integer()
)
)
problems(crime)
Just realized that time doens’t really matter for my purposes. I’m just interested in date… but, hey, no harm in practicing!
time_err_ind <- !grepl(":", crime$CrimeTime)
sum(time_err_ind) # number of errors
[1] 4324
time_err <- crime$CrimeTime[time_err_ind]
examine time errors
head(time_err, 50)
[1] "1218" "2014" "1929" "1859" "1859" "1840" "0245" "1856" "1834" "1834" "0257"
[12] "1658" "1658" "1658" "1545" "1620" "2100" "2010" "1956" "1326" "2013" "1710"
[23] "1710" "1710" "1706" "1050" "2324" "2208" "1857" "2118" "1316" "2115" "1835"
[34] "1200" "1123" "2126" "1902" "1457" "1312" "1226" "2228" "1601" "2208" "0322"
[45] "0322" "0322" "0211" "0055" "1639" "1639"
# string length of errors
str_len <- nchar(time_err)
table(str_len)
str_len
3 4 5
1 4322 1
time_err[str_len != 4]
[1] "1826h" "242"
don’t strip seconds, there are unique values
# any seconds besides '00' in correctly formatted values
sub("(.*)(\\:[0-9]{2}$)", "\\2", crime$CrimeTime) %>%
unique() %>%
head(50)
[1] ":00" "1218" "2014" "1929" "1859" "1840" "0245" "1856" "1834" ":13" "0257"
[12] "1658" "1545" "1620" ":09" ":10" "2100" "2010" "1956" ":33" "1326" ":50"
[23] "2013" "1710" "1706" ":43" ":11" "1050" ":49" "2324" ":22" ":03" ":56"
[34] ":38" "2208" "1857" ":32" ":46" "2118" ":07" "1316" ":19" ":54" "2115"
[45] "1835" "1200" ":31" "1123" "2126" "1902"
Stripped “h” from 5 length error, added “0” to front of 3 length error and then added “:” to the middle of every time plus “:00:” at end to make time format all the same.
time_adj <- crime$CrimeTime
time_adj <- gsub("h", "", time_adj)
unique(nchar(time_adj))
[1] 8 4 5 3
still has length 5 character?
time_adj[nchar(time_adj) == 5]
[1] "12:27"
it has a colon so it’s fine. I’ll just process all the 4 lengths in 2 steps to capture it
fix 3 length issue
time_adj <- if_else(nchar(time_adj) == 3, paste0("0", time_adj), time_adj)
unique(nchar(time_adj))
[1] 8 4 5
adj all
time_adj <- if_else(nchar(time_adj) == 4,
sub("([0-9]{2})([0-9]{2})", "\\1:\\2", time_adj),
time_adj)
unique(nchar(time_adj))
[1] 8 5
time_adj[nchar(time_adj) == 5] %>% head(50) #looks good, add secs
[1] "12:18" "20:14" "19:29" "18:59" "18:59" "18:40" "02:45" "18:56" "18:34" "18:34"
[11] "02:57" "16:58" "16:58" "16:58" "15:45" "16:20" "21:00" "20:10" "19:56" "13:26"
[21] "20:13" "17:10" "17:10" "17:10" "17:06" "10:50" "23:24" "22:08" "18:57" "21:18"
[31] "13:16" "21:15" "18:35" "12:00" "11:23" "21:26" "19:02" "14:57" "13:12" "12:26"
[41] "22:28" "16:01" "22:08" "03:22" "03:22" "03:22" "02:11" "00:55" "16:39" "16:39"
time_adj <- if_else(nchar(time_adj) == 5,
paste0(time_adj, ":00"),
time_adj)
unique(nchar(time_adj))
[1] 8
# reformat as date
crime <- mutate(crime, CrimeTime = parse_time(time_adj, format = "%H:%M:%S"))
head(crime$CrimeTime, 20)
23:55:00
23:37:00
23:00:00
22:25:00
21:56:00
21:50:00
21:30:00
21:15:00
21:03:00
21:00:00
21:00:00
20:36:00
20:30:00
20:24:00
20:15:00
20:00:00
19:14:00
19:00:00
18:51:00
18:50:00
what data is available?
str(crime)
Classes tbl_df, tbl and 'data.frame': 251676 obs. of 15 variables:
$ CrimeDate : Date, format: "2018-03-03" "2018-03-03" ...
$ CrimeTime :Classes 'hms', 'difftime' atomic [1:251676] 86100 85020 82800 80700 78960 ...
.. ..- attr(*, "units")= chr "secs"
$ CrimeCode : Factor w/ 81 levels "6J","4C","4E",..: 1 2 3 3 3 2 4 3 5 6 ...
$ Location : chr "1000 ASHLAND CT" "2800 W NORTH AVE" "1400 E FORT AVE" "800 E JEFFREY ST" ...
$ Description : Factor w/ 15 levels "LARCENY","AGG. ASSAULT",..: 1 2 3 3 3 2 4 3 5 2 ...
$ Inside/Outside : Factor w/ 4 levels "I","O","Outside",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weapon : Factor w/ 4 levels "OTHER","HANDS",..: NA 1 2 2 2 1 NA 2 3 2 ...
$ Post : int 312 811 943 913 111 415 323 841 721 426 ...
$ District : chr "EASTERN" "SOUTHWESTERN" "SOUTHERN" "SOUTHERN" ...
$ Neighborhood : chr "Oldtown" "Walbrook" "Locust Point" "Brooklyn" ...
$ Longitude : num -76.6 -76.7 -76.6 -76.6 -76.6 ...
$ Latitude : num 39.3 39.3 39.3 39.2 39.3 ...
$ Location 1 : chr "(39.30019, -76.60352)" "(39.30923, -76.66498)" "(39.26944, -76.59492)" "(39.23236, -76.60041)" ...
$ Premise : chr "ROW/TOWNHO" "ROW/TOWNHO" "BARBER/BEA" "ROW/TOWNHO" ...
$ Total Incidents: int 1 1 1 1 1 1 1 1 1 1 ...
i don’t know what the crime code means and a quick google search didn’t help me figure it out so I’ll just leave it alone.
The helpful variables will be the ‘Description’ and ‘Weapon’. I don’t know if it being inside or outside would be worse and if it happened in a house or business doesn’t matter so much to me because the proximity to danger is the same, so Premise doesn’t matter. I wonder if there are any Total Incidents greater than 1. That might matter
sum(crime$`Total Incidents` > 1)
[1] 0
sum(crime$`Total Incidents` != 1)
[1] 0
‘Total Incidents’ is always 1 so that’s useless.
So, I’ll want my final weight to be some combination of Description and Weapon adjusted for time (because it’s less relevant if it happened years ago).
The more dangerous something is to my family the worse it is. So, I’ll weight those higher. I also want to protect my car if possible, preferentially from being stolen completely and then from damage. I’m less worried about stuff being stolen from it because I can leave it somewhat empty (minus the emergency stuff).
levels(crime$Description) %>% sort()
[1] "AGG. ASSAULT" "ARSON" "ASSAULT BY THREAT"
[4] "AUTO THEFT" "BURGLARY" "COMMON ASSAULT"
[7] "HOMICIDE" "LARCENY" "LARCENY FROM AUTO"
[10] "RAPE" "ROBBERY - CARJACKING" "ROBBERY - COMMERCIAL"
[13] "ROBBERY - RESIDENCE" "ROBBERY - STREET" "SHOOTING"
count(crime, Description) %>% arrange(desc(n))
levels(crime$Weapon)
[1] "OTHER" "HANDS" "FIREARM" "KNIFE"
I’m not sure what ‘OTHER’ weapon refers to.
count(crime, Description, Weapon) %>%
spread(key = Weapon, value = n) %>%
arrange(desc(FIREARM), desc(KNIFE), desc(OTHER))
Based on comparison to firearms and knives it looks like ‘Other’ is weapons worse than ‘Hands’, so I’ll weight them in that order.
If I were to order the type of crime alone it would probably be:
Larceny comes with living in a city. We’ll deal with it.
There’s going to be some relatedness between type of weapon and crime. I’ll quickly replace any NA values with the value None and then examine the relatedness of weapon and description using a decision tree and some correlation calcs before I decide on final weights and how to organize them.
sum(is.na(crime$Description))
[1] 0
sum(is.na(crime$Weapon))
[1] 164181
crime <- mutate(crime,
Weapon = as.factor(
if_else(is.na(Weapon), "NONE", as.character(Weapon))
)
)
str(crime)
Classes tbl_df, tbl and 'data.frame': 251676 obs. of 15 variables:
$ CrimeDate : Date, format: "2018-03-03" "2018-03-03" ...
$ CrimeTime :Classes 'hms', 'difftime' atomic [1:251676] 86100 85020 82800 80700 78960 ...
.. ..- attr(*, "units")= chr "secs"
$ CrimeCode : Factor w/ 81 levels "6J","4C","4E",..: 1 2 3 3 3 2 4 3 5 6 ...
$ Location : chr "1000 ASHLAND CT" "2800 W NORTH AVE" "1400 E FORT AVE" "800 E JEFFREY ST" ...
$ Description : Factor w/ 15 levels "LARCENY","AGG. ASSAULT",..: 1 2 3 3 3 2 4 3 5 2 ...
$ Inside/Outside : Factor w/ 4 levels "I","O","Outside",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weapon : Factor w/ 5 levels "FIREARM","HANDS",..: 4 5 2 2 2 5 4 2 1 2 ...
$ Post : int 312 811 943 913 111 415 323 841 721 426 ...
$ District : chr "EASTERN" "SOUTHWESTERN" "SOUTHERN" "SOUTHERN" ...
$ Neighborhood : chr "Oldtown" "Walbrook" "Locust Point" "Brooklyn" ...
$ Longitude : num -76.6 -76.7 -76.6 -76.6 -76.6 ...
$ Latitude : num 39.3 39.3 39.3 39.2 39.3 ...
$ Location 1 : chr "(39.30019, -76.60352)" "(39.30923, -76.66498)" "(39.26944, -76.59492)" "(39.23236, -76.60041)" ...
$ Premise : chr "ROW/TOWNHO" "ROW/TOWNHO" "BARBER/BEA" "ROW/TOWNHO" ...
$ Total Incidents: int 1 1 1 1 1 1 1 1 1 1 ...
# Weapon ~ Description was not what I wanted
crime_tree <- train(Description ~ Weapon, data = crime, method = "rpart")
crime_tree
CART
251676 samples
1 predictor
15 classes: 'LARCENY', 'AGG. ASSAULT', 'COMMON ASSAULT', 'ROBBERY - RESIDENCE', 'ROBBERY - COMMERCIAL', 'ROBBERY - STREET', 'LARCENY FROM AUTO', 'BURGLARY', 'AUTO THEFT', 'ROBBERY - CARJACKING', 'ASSAULT BY THREAT', 'ARSON', 'SHOOTING', 'HOMICIDE', 'RAPE'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 251676, 251676, 251676, 251676, 251676, 251676, ...
Resampling results across tuning parameters:
cp Accuracy Kappa
0.005695894 0.4692203 0.3481855
0.112349226 0.4015323 0.2494283
0.204077489 0.3128910 0.1278410
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.005695894.
fancyRpartPlot(crime_tree$finalModel)
Shows predicted class, predicted probability of each class, and percentage of observations in each node (but I don’t know the order). Oh well, that turned out to be less useful than I thought since the model only took into account hands and none as weapons.
Let’s calculate the strength of association using Cramér’s V and the vcd package.
assocstats(table(crime$Description, crime$Weapon))
X^2 df P(> X^2)
Likelihood Ratio 413153 56 0
Pearson 510717 56 0
Phi-Coefficient : NA
Contingency Coeff.: 0.818
Cramer's V : 0.712
chisq.test(crime$Description, crime$Weapon)
Pearson's Chi-squared test
data: crime$Description and crime$Weapon
X-squared = 510720, df = 56, p-value < 2.2e-16
Description and Weapon weightsFor Description I will weight things as follows:
For Weapon I will weight things as follows:
These 2 weights will be added to create a final Crime-Weapon weight and adjusted so that the lowest value is 1 (i.e. has no special weight).
crime_wt <- c(LARCENY = 1, 'LARCENY FROM AUTO' = 1, ARSON = 1,
'ASSAULT BY THREAT' = 1, BURGLARY = 2, 'ROBBERY - COMMERCIAL' = 2,
'COMMON ASSAULT' = 4.5, 'AUTO THEFT' = 4.5, 'ROBBERY - STREET' = 4.5,
'ROBBERY - CARJACKING' = 6, 'ROBBERY - RESIDENCE' = 6, SHOOTING = 7,
RAPE = 7, 'AGG. ASSAULT' = 7, HOMICIDE = 7)
crime_wt <- as.data.frame(crime_wt) %>%
rownames_to_column(var = "Description")
wep_wt <- c(NONE = 0, HANDS = 0, OTHER = 3, KNIFE = 5, FIREARM = 7)
wep_wt <- as.data.frame(wep_wt) %>%
rownames_to_column(var = "Weapon")
crime <- left_join(crime, crime_wt, by = "Description") %>%
left_join(wep_wt, by = "Weapon")
Column `Description` joining factor and character vector, coercing into character vectorColumn `Weapon` joining factor and character vector, coercing into character vector
str(crime)
Classes tbl_df, tbl and 'data.frame': 251676 obs. of 17 variables:
$ CrimeDate : Date, format: "2018-03-03" "2018-03-03" ...
$ CrimeTime :Classes 'hms', 'difftime' atomic [1:251676] 86100 85020 82800 80700 78960 ...
.. ..- attr(*, "units")= chr "secs"
$ CrimeCode : Factor w/ 81 levels "6J","4C","4E",..: 1 2 3 3 3 2 4 3 5 6 ...
$ Location : chr "1000 ASHLAND CT" "2800 W NORTH AVE" "1400 E FORT AVE" "800 E JEFFREY ST" ...
$ Description : chr "LARCENY" "AGG. ASSAULT" "COMMON ASSAULT" "COMMON ASSAULT" ...
$ Inside/Outside : Factor w/ 4 levels "I","O","Outside",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weapon : chr "NONE" "OTHER" "HANDS" "HANDS" ...
$ Post : int 312 811 943 913 111 415 323 841 721 426 ...
$ District : chr "EASTERN" "SOUTHWESTERN" "SOUTHERN" "SOUTHERN" ...
$ Neighborhood : chr "Oldtown" "Walbrook" "Locust Point" "Brooklyn" ...
$ Longitude : num -76.6 -76.7 -76.6 -76.6 -76.6 ...
$ Latitude : num 39.3 39.3 39.3 39.2 39.3 ...
$ Location 1 : chr "(39.30019, -76.60352)" "(39.30923, -76.66498)" "(39.26944, -76.59492)" "(39.23236, -76.60041)" ...
$ Premise : chr "ROW/TOWNHO" "ROW/TOWNHO" "BARBER/BEA" "ROW/TOWNHO" ...
$ Total Incidents: int 1 1 1 1 1 1 1 1 1 1 ...
$ crime_wt : num 1 7 4.5 4.5 4.5 7 6 4.5 2 7 ...
$ wep_wt : num 0 3 0 0 0 3 0 0 7 0 ...
Events that happened years ago should be weighted less. I need to determine how many events happen in a year to have some idea as to how to weight them.
tmp <- tibble(year = year(crime$CrimeDate), month = month(crime$CrimeDate))
# crimes/year
table(tmp$year)
2013 2014 2015 2016 2017 2018
49568 45973 48858 48847 51753 6677
# mean(crimes)/month
table(tmp$year, tmp$month) %>%
.[. > 500] %>% # removes current and future months (with low counts)
mean()
[1] 4055.839
That’s a lot of crime in a month and year. Let’s see, I guess the best way to do it is to select a period that will not be down-weighted and then gradually down-weight by month (I may have to iterate on this). In the last size months there would have been about 2.43310^{4} crimes. The area of Baltimore City is about 92.1 sq miles. So that’s about 264 crimes per square mile which doesn’t seem too heavy.
So anything within the last 6 months will be our \(final\_wt * 1\). There are 63 months in the data set so we’ll just develop a sequence from this month to the last headed toward zero.
# split into 63 periods starting at today
t <- map_dbl(0:63, ~today() %m-% months(.x)) %>% as_date()
# oldest date between last 2 values?
nth(t, -2) > min(crime$CrimeDate) & min(crime$CrimeDate) > last(t)
[1] TRUE
# weights by month
month_wt <- seq(1, 0, along.with = t)
now add this to crime
# SET time_wt as double with length = crime observations
t_wt <- numeric(nrow(crime))
# FOR each month division, m
# IF date is < upper month date AND date >= lower month date THEN
# SET weight equal to weight for this division
# ELSE
# DO nothing
#END FOR
for (n in seq_along(t)) {
t_wt <- if_else(
crime$CrimeDate < t[n] & crime$CrimeDate >= t[n + 1],
month_wt[n],
t_wt
)
}
# verify correct
unique(t_wt)
[1] 1.00000000 0.96825397 0.87301587 0.80952381 0.77777778 0.98412698 0.95238095
[8] 0.93650794 0.92063492 0.90476190 0.88888889 0.85714286 0.84126984 0.82539683
[15] 0.79365079 0.76190476 0.74603175 0.73015873 0.71428571 0.69841270 0.68253968
[22] 0.66666667 0.65079365 0.63492063 0.61904762 0.60317460 0.58730159 0.57142857
[29] 0.55555556 0.53968254 0.52380952 0.19047619 0.17460317 0.50793651 0.49206349
[36] 0.47619048 0.46031746 0.44444444 0.42857143 0.41269841 0.39682540 0.38095238
[43] 0.36507937 0.34920635 0.33333333 0.31746032 0.30158730 0.28571429 0.26984127
[50] 0.25396825 0.23809524 0.22222222 0.20634921 0.15873016 0.14285714 0.12698413
[57] 0.11111111 0.09523810 0.07936508 0.06349206 0.04761905 0.03174603 0.01587302
length(t_wt) == nrow(crime)
[1] TRUE
crime <- mutate(crime,
time_wt = t_wt,
overall_wt = (crime_wt + wep_wt) * time_wt)
# verify no zero vals
range(crime$overall_wt)
[1] 0.01587302 14.00000000
str(crime)
Classes tbl_df, tbl and 'data.frame': 251676 obs. of 19 variables:
$ CrimeDate : Date, format: "2018-03-03" "2018-03-03" ...
$ CrimeTime :Classes 'hms', 'difftime' atomic [1:251676] 86100 85020 82800 80700 78960 ...
.. ..- attr(*, "units")= chr "secs"
$ CrimeCode : Factor w/ 81 levels "6J","4C","4E",..: 1 2 3 3 3 2 4 3 5 6 ...
$ Location : chr "1000 ASHLAND CT" "2800 W NORTH AVE" "1400 E FORT AVE" "800 E JEFFREY ST" ...
$ Description : chr "LARCENY" "AGG. ASSAULT" "COMMON ASSAULT" "COMMON ASSAULT" ...
$ Inside/Outside : Factor w/ 4 levels "I","O","Outside",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Weapon : chr "NONE" "OTHER" "HANDS" "HANDS" ...
$ Post : int 312 811 943 913 111 415 323 841 721 426 ...
$ District : chr "EASTERN" "SOUTHWESTERN" "SOUTHERN" "SOUTHERN" ...
$ Neighborhood : chr "Oldtown" "Walbrook" "Locust Point" "Brooklyn" ...
$ Longitude : num -76.6 -76.7 -76.6 -76.6 -76.6 ...
$ Latitude : num 39.3 39.3 39.3 39.2 39.3 ...
$ Location 1 : chr "(39.30019, -76.60352)" "(39.30923, -76.66498)" "(39.26944, -76.59492)" "(39.23236, -76.60041)" ...
$ Premise : chr "ROW/TOWNHO" "ROW/TOWNHO" "BARBER/BEA" "ROW/TOWNHO" ...
$ Total Incidents: int 1 1 1 1 1 1 1 1 1 1 ...
$ crime_wt : num 1 7 4.5 4.5 4.5 7 6 4.5 2 7 ...
$ wep_wt : num 0 3 0 0 0 3 0 0 7 0 ...
$ time_wt : num 1 1 1 1 1 1 1 1 1 1 ...
$ overall_wt : num 1 10 4.5 4.5 4.5 10 6 4.5 9 7 ...
write_csv(crime, path = paste0(sub(".csv$", "", save_file), "-WEIGHTED.csv"))
After loading the data into a new fusion table, I set the location to use latitude and longitude, chose heat map, set the weights to overall_wt, and attempted various positions for the “radius”. The resulting map was unexpectedly sparse.
Baltimore City Crime Map
It turns out the fusion table map is using the MAPS API Heatmap layer which cannot handle mapping more than 1,000 rows. No wonder the data was sparse!
At this point I realized the map wasn’t going to be what I hoped. But, out of curiosity, I adjusted the map by filtering the data. I was able to reduce the number of rows to 998 by selecting only the data occurring between 2018-01-01 and 2018-03-10 and setting a filter to include only the top 6 worse crimes (in my opinion: Homicide, Agg. Assault, Rape, Shooting, Robbery - Carjacking, and Robbery - Residence).
Baltimore City Crime Map - Filtered
Sheesh! Even with only about 2 months and 6 types of crime Baltimore City at this level looks like a pretty dangerous place (except maybe the Roland Park area).
If I zoom in further to either of these maps to look at the data on a neighborhood level the individual values become too sparse to be useful. At this zoom level, the safer looking areas probably look safer because of missing data, more so than actually being safer.
If the whole point of this is to identify safer areas for a family to live in. I’m going to need a new approach!